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Abstract 

The genetic basis of multiple phenot5rpes such as gene expression, metabolite levels, or 
imaging features is often investigated by testing a large collection of h5rpotheses, probing the 
existence of association between each of the traits and hundreds of thousands of genotyped 
variants. Appropriate multiplicity adjustment is crucial to guarantee replicability of findings, 
and False Discovery Rate (FDR) is frequently adopted as a measure of global error. In the 
interest of interpretability, results are often summarized so that reporting focuses on variants 
discovered to be associated to some phenotypes. We show that applying FDR-controlling 
procedures on the entire collection of h5rpotheses fails to control the rate of false discovery 
of associated variants as well as the average rate of false discovery of phenot5pes influenced 
by such variants. We propose a simple hierarchical testing procedure which allows control of 
both these error rates and provides a more reliable basis for the identification of variants with 
functional effects. We demonstrate the utility of this approach through simulation studies 
comparing various error rates and measures of power for genetic association studies of mul¬ 
tiple traits. Finally, we apply the proposed method to identify genetic variants which impact 
flowering phenot5rpes in Arabdopsis thaliana, expanding the set of discoveries. 


Introduction 

Biotechnological progress has enabled the routine measurement of thousands of phenotypes that 
were beyond the reach of precise quantification just a couple of decades ago. Together with the 
reduced costs of genotyping and sequencing, this motivates research into the genetic basis of an 
unprecedented number of traits. Examples include eQTL studies H]-^ that investigate the role 
of genetic variation on the expression of tens of thousands of genes; genome-wide metabolomics 
studies BH O that consider genetic influences on the levels of hundreds of metabolites; and 
proteomics studies investigating genetic regulation of protein abundances IIH IZl- At a more 
macroscopic level, neuroimaging genetics 181 aims to identify DNA variants influencing brain 
structures, described in thousands of voxels. Looking at even higher-level phenotypes, a number 
of large cohorts with rich phenotypic information have been or are being genotyped and will be 
used to map multiple traits. Notable examples are the Kaiser Permanente Research Program on 
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Genes, Environment, and Health (RPGEH) |9| that has already genotyped 100,000 subjects with 
complete medical records, and the Million Veterans Program IflOl that is aiming to genotype a 
million veterans with available health records. 

Investigating the genetic basis of thousands of traits simultaneously offers exciting possibil¬ 
ities, including the hope that a comprehensive and multifaceted description of the health status 
of a subject can provide a strong foundation for understanding relevant genetic underpinnings. 
Gapitalizing on these possibilities requires appropriate statistical approaches to address the chal¬ 
lenges posed by these novel data sets. Here, we focus on one such problem: namely, the devel¬ 
opment of multiple-testing procedures to identify discoveries while controlling an appropriate 
measure of error. Two choices need to be made up-front: (1) what notion of error to control; and 
(2) what is to be considered a discovery. We discuss these at the beginning of our study. In what 
follows, the terms 'trait' and 'phenotype' are used interchangeably; similarly, and with a slight 
abuse, 'SNP' (Single Nucleotide Polymorphism) and 'variant' are considered synonymous. 

The genetics community has been acutely aware of the necessity of accounting for the "look 
across the genome" effect. Even before genome-wide linkage (or association) studies were a 
possibility, sequential test procedures HTTH and Bayesian arguments IIT^ led to the adoption of 
very stringent significance cut-offs. Once large marker panels became available and multiple¬ 
testing became a reality, efforts focused on controlling the probability of making at least one 
erroneous finding, a criteria known as the famil}rwise error rate (EWER) IIT^ IT^ . This is well 
suited to investigate the genetic basis of a single disease assumed to be substantially influenced 
by one or two loci, especially when following up a hit implies years of work. The nature of 
present-day multi-trait investigations, however, is substantially different: when one explores the 
genetic basis of tens of thousands of traits, as in eQTE studies, insisting on not making even one 
mistake is overly severe. Indeed, on the heels of the experience in analysis of gene expression data 
OS SSI, in eQTE and other -omics investigations, another more liberal criteria has emerged as 
the dominant paradigm: the false discovery rate (EDR) IflTl . The EDR is defined as the expected 
proportion of findings that are erroneous, meaning that they correspond to situations where the 
null hypothesis is actually true. The present work adopts the point of view that such a criteria 
better reflects the goals of multi-phenotype studies where one expects to make a sizable number 
of discoveries, and it is acceptable to have a few false leads as long as these represent a small 
proportion of the findings ll^fTSl. 

In order to control EDR one needs to define a discovery. What constitutes an interesting 
finding? the identification of a variant that influences a specific phenotype? the determination 
that there is a genetic component to the variability of a trait? the discovery that one DNA variant 
is not neutral? all of the above? in which order of importance? To resolve these questions it 
is useful to look at the typical multi-phenotype genome-wide association study (GWAS): this 
consists in testing the hypothesis H^t of no association between variant v and trait t for all val¬ 
ues of V and t. This rather simplistic approach is often preferred for its limited computational 
cost, its robustness to missing data, and—most importantly—the ease with which results on dif¬ 
ferent phenotypes and SNPs can be compared across different studies. The collection of tested 
hypotheses {Hj^t v = 1 ,... ,M; t = 1 ,... ,P} can be considered as a single group, but it is also 
quite natural to identify sub-groups of hypotheses that address one specific scientific question, 
technically referred to as families. Note that—following the convention in multiple comparison 
literature—^we here use the term 'family' to indicate a collection of hypotheses rather than a 
group of related individuals; pedigrees do not play a role in the discussion. One can consider 
the families Vt = {H^t v = 1 ,.. .,M} of all hypotheses related to the phenotype f, addressing 
the existence of a genetic basis for the fth trait. Alternatively, one can focus on the families 
f = 1,..., P} of all hypotheses involving SNP v, investigating the phenotypic effect of 
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Table 1: Example of structured hypotheses: the 40 hypotheses Hu,.. .,Hs 5 are grouped into 
families Bold hypotheses are false null, and starred hypotheses correspond to rejec¬ 

tions. 

each genetic variant v. To these families we can associate global null hypotheses: Hy, = 
signifies that variant v does not affect any trait, while states that trait t is not 

influenced by any variant. Identifying a relevant family structure is important both because fam¬ 
ilies are the appropriate universe for multiplicity adjustment and because they define discoveries. 
Ultimately this choice is study specific, but here we make one both in the interest of concreteness 
and to underscore a viewpoint that is often relevant. In most multi-phenotype GWAS, scientists 
have solid reason to believe that the traits under investigation have a genetic underpinning, so 
rejecting H,t would not represent an interesting discovery. In contrast, we expect most genetic 
variants to have no effect on any trait, so identifying those that are 'functional' can arguably 
be considered the most important discovery of multi-phenotype investigations. Consider, for 
example, eQTL studies: the discovery of SNPs that influence the expression of some genes is 
important as they are considered potential candidates for association with a variety of other 
medically relevant traits. Indeed, the reported results from multi-phenotype GWAS tend to be 
organized in terms of associated variants. In what follows, then, we consider the hypotheses 
{Hjjt V = 1,.. .,M; t = 1,.. .,P} as organized in M families defined by variants, and we 
identify the rejection of Hy, as an important discovery. Once a decision has been made that the 
hypotheses under consideration can be grouped in different families, it becomes relevant and 
meaningful to talk about a variety of global error measures, as we are about to describe. 


Material and Methods 

Global error measures for structured hypotheses 

We start by considering one simple example where we assume that we know the true status 
of the hypotheses and we can measure the realized False Discovery Proportion (FDP). Table 
presents a total of 40 hypotheses, relative to 8 phenotypes and 5 variants, which define families 
Tv, V = 1,... ,5. We use bold to indicate hypotheses that are false null (where signal/association 
is present) and asterisks to indicate hypotheses that are rejected. A variant is discovered if the 
corresponding family contains at least one rejected hypothesis. In Table there are a total of 
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10 individual hypotheses rejected and two of these are true nulls: the global false discovery 
proportion (gPDP) equal to 2/10. Pamilies T\,T 2 ,andTj, are discovered, but all the hypotheses 
in are true nulls: the proportion of false family discoveries is 1/3. The average PDF (aPDP) 
across all families is 0.23 = (0 + l + l/6 + 0 + 0)/5; but if we focus only on families that have 
been discovered, the average PDF across selected families (sFDP) is 0.38 = (0 + 1 + l/6)/3. 

With this example in mind, we can define a variety of error rates. Let P indicate the col¬ 
lection of p-values associated with all the individual hypotheses tested. Let 5(P) be a selection 
procedure (which can depend on the observed p-values) that identifies interesting variants. Let 
R be the total number of rejections and F the total number of erroneous rejections across all 
hypotheses. Similarly, F, and R; count the false discoveries and total discoveries in family i. We 
say that variant i is discovered if the corresponding global null Hi, is rejected. We indicate with 
R^ and F^, respectively, the total number of rejections and the total number of false discoveries 
among the M global hypotheses Ff„s, probing the role of variant i. 

• gPDR = E(y/ max{R, 1}) is the global FDR that ignores the division of the hypotheses into 
families. 

• FDR; =E(V;/ max{R;, 1}) is the FDR within family i. 

• aPDR = E(L, / max{R„ 1}) is the average of the within-family FDRs. 

• sFDR = E;g 5 (p) (Vj/max{R;, 1})] is the expected value of the average of the 

within-family FDPs, where the average is taken only across families that have been selected. 

• vFDR = E(l/‘’/ max{R^, 1}) is the FDR for the discovery of variants (families). 

Focusing on f[F > 0] rather than F/ max{R, 1} and the appropriate modifications of these, one 
can define the corresponding set of FWERs. 

Given these alternatives, what error rate is relevant and important to control? The gFDR is 
a natural first choice as this is the error rate we would control if we had not identified a family 
structure among our hypotheses. Despite the appeal of its simplicity, there are caveats to be 
considered when targeting gFDR. As shown eloquently in Efron HT^ , pooling hypotheses from 
multiple families that have different proportions of true nulls and controlling gEDR can result in 
rather sub-optimal behavior: for families that contain none or very few false nulls, EDR, will not 
be controlled at the desired level, while families with many false nulls will encounter a loss in 
power. If one targets EDR, for each family separately, these difficulties are overcome but at the 
price of a large number of false discoveries: while aEDR would be controlled, gEDR and sEDR 
would not. In addition, if we consider Hj, rejected as long as one of Hu, t = 1,..., F is, it is 
important to note that neither of the two strategies above controls vEDR or sEDR. 

To illustrate these characteristics, we run a simulation with 300,000 hypotheses correspond¬ 
ing to P=100 phenotypes and M=3000 variants. Families are defined by variants and contain 
only true null hypotheses, with the exception of 60 variants each associated to 25 phenotypes. 
F-values corresponding to the true null hypotheses are generated independently from a uniform 
distribution on the [0,1] interval. Test statistics for the false null hypotheses are generated inde¬ 
pendently from the Af{2,cr^) distribution, and the corresponding p-values are computed as the 
two-tailed p-values under the distribution. Since larger values of the standard devia¬ 

tion cr make these two distributions more difficult to distinguish, we can interpret a as the noise 
level. Figure 1 shows a set of global error measures as the noise level increases. We also provide 
two measures of power: gPower represents global power, and vPower represents power to detect 
variants associated to at least one phenotype. We compare three approaches for the analysis 
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Figure 1: Global error achieved by different multiple comparisons controlling strategies as noise 
level increases. M=3000; P=100; 60 variants are associated to 25 phenotypes and the rest have 
no association. The lines indicate the average and the shaded areas the standard error over 250 
iterations. 

of the data sets: (a) the Benjamini-Hochberg (BH) method lUTI applied to the pooled collection 
of all p-values with target level q = 0.05 for gFDR ("pooled BH"); (b) BH applied to each fam¬ 
ily separately with target level q = 0.05 for each FDR, ("per family BH"); and (c) a hierarchical 
strategy we will discuss in the following section and included here for reference ("hierarchical 
BH"). Figure illustrates how both (a) "pooled BH" and (b) "per family BH" control their target 
error rate (gFDR and aFDR, respectively), but not vFDR or sFDR. When (a) BH is applied to the 
entire collection of hypotheses, the false rejections are uniformly distributed across the true null 
hypotheses; in a context where many variants affect no phenotypes, this results in false variant 
discoveries. Furthermore, once we restrict attention to the families with at least one rejection, 
many have a within-family FDP close to 1: we do not have control of the error we make when 
declaring association between phenotypes and the selected SNPs. 

If we apply BH in a per family manner (b), the aPDR is controlled: many families lead 
to no discoveries, resulting in a FDP equal to 0, which lowers the average FDR. However, the 
families associated with discovered variants tend to have very large FDP: neither sFDR or gFDR is 
controlled. From a certain point of view, applying BH to each family separately can be considered 
as ignoring the multiplicity due to different variants, so it is not surprising that vFDR and gFDR 
are quite high with this approach. In summary, (b) does not appear to be a viable strategy 
whenever M is large. We now introduce procedure (c) that overcomes this impasse. 
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Figure 2: Hierarchical structure of hypotheses. 

Hierarchical testing procedure 

Benjamini and Bogomolov |[20l describe how to control sFDR when families are selected accord¬ 
ing to a rather broad set of criteria. Here, we build upon their work and suggest selecting families 
so as to control the vFDR: this allows us to provide both guarantees on the discovered variants 
and on the identification of the phenotypes they influence. To avoid unnecessary complexity we 
assume that each family contains the same number of hypotheses, although this is not necessary. 

We aim to control FDR on the collection of M global null hypotheses Hy, = 

{Hi,, V = \,..., M} at level q-i. Once a set of interesting families {Tv, c* € 5} has been identified 
by controlling the vFDR, we aim to control the sFDR, that is the average FDR on the selected 
families, at level q 2 . 

Testing is carried out on the basis of the p-values 'pvi obtained for each of the individual 
hypotheses Hyt- The p-values for the intersection hypotheses Hy, are defined as the Simes's 
p-values H2T I for the respective families: 


Pd. 


min 


^Pv(t) 

t 


( 1 ) 


where py(^t) represents the fth ordered element of the vector [pvt, t = 1,P} . The hierarchical 
procedure is as follows: 

Testing Procedure 1 

Stage 0 Use Simes's method to obtain p-values py,sfor the intersection hypotheses Hy,s. 

Stage 1 Apply BH to the collection of p-values [pv,] = 1,.. .,M} with an FDR target level qi. Let 
iS(P) indicate the set ofv corresponding to rejected hypotheses Hy,. 

Stage 2 Proceed to test the individual hypotheses Hyt only in families Ty with v G iS(P). Within such 
families, apply BH with target level q 2 x the appropriate adjustment for the selection bias 

introduced in Stage 1. 

Testing Procedure 1 guarantees vFDR control when the Simes's p-values are valid p-values 
for the intersection hypotheses and when BH applied to {py,,v = 1,...,M} controls FDR. It 
also guarantees control of sFDR when BH applied to each family Ty controls FDR within the 
family and the p-values in each family are independent of the p-values in any other family. 
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or when the pooled set of p-values satisfies a certain positive dependence property (see later for 
more details regarding the control of vFDR and sFDR of Testing Procedure 1 under dependence). 
Figure illustrates how the hierarchical procedure controls vFDR and sFDR in the setting of the 
simulation described in the previous section. In the remainder of this paper, we will explore in 
some detail when conditions for Testing Procedure 1 to control its target error rate are satisfied 
and how applicable they are to the tests we encounter in GWAS with multiple phenotypes. First, 
however, some remarks are useful. 

• In Stage 0, we suggested using Simes's p-value for three reasons: it can be easily constructed 
from the single hypothesis p-values; it is robust to most common types of dependence be¬ 
tween the test statistics in the family Il22l |23l; and, finally, its combination with BH leads 
to consistent results between stages, as will be discussed in more detail later. However, 
other choices are possible and might be more effective in specific situations. For example, 
when the tests across phenotypes can be considered independent, it might be advanta¬ 
geous to combine p-values using Fisher's rule [241: this might lead to the identification of 
SNPs that have a very modest effect on multiple phenotypes, so that their influence can 
only be gathered by combining these effects. If appropriate distributional assumptions are 
satisfied, another choice might be the Higher Criticism statistic [251 . Finally, one might 
obtain a p-value py, for the intersection hypothesis by means other than the combination 
of the p-values for individual hypotheses. For example, one can use a reverse regression 
approach as in [261 , in which a regression is fit for each genetic variant treating the full set 
of phenotypes as the predictors and the SNP genotype as an ordinal response. 

• Stage 1 focuses on the discovery of interesting families which correspond to genetic vari¬ 

ants associated with variability in phenotypes: a multiplicity adjustment that controls the 
desired error rate on = 1,... ,M} needs to be in place. For FDR control we rely on 

BH, which has been shown to perform well under the types of dependence across mark¬ 
ers present in the GWAS setting |j2l- The more conservative Benjamini-Yekutieli procedure 
Il27ll , with its theoretical guarantees, is also possible. Some might prefer to control FWER at 
this level via a Bonferroni procedure: this would be in keeping with the criteria routinely 
adopted in genome-wide association studies. In the simulations that follow we explore the 
properties of this approach as well. 

• Stage 2 identifies phenotypes associated with interesting SNPs. It rests on the results in ||20| : 
to control the average error rate across the selected families at level q 2 , one has to perform 
a multiplicity adjustment within each family at a more stringent level q 2 x to account 
for the selection effect. Again, this result is more general than implied in Testing Procedure 
1. For example, one might want to control the average FWER across selected families: this 
would be possible by using Bonferroni at the appropriate level. It is useful to observe the 
interplay of selection penalty and Bonferroni correction. If only one family is selected, the 
threshold for significance is the same that would result from applying Bonferroni to the 
entire collection of hypotheses. If all families are selected, the threshold for significance is 
simply and there is no price for multiplicity across families. When more than one family 
is selected, the threshold is between these two. In general, it can be shown that controlling 
the average FWER across selected families is more liberal than controlling global FWER. 
It is not possible to make such a general statement with respect to FDR, but it remains 
true that the hierarchical procedure has the potential of increasing power by reducing the 
multiple comparisons burden via relevant selection of which hypotheses to test. 
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• Testing Procedure 1 controls sFDR in Stage 2 by controlling FDR within each selected family 
at a more stringent level. One interesting aspect of this approach is that BH is applied to 
each selected family separately: this allows for adaptivity to the family specific proportion 
of true nulls, overcoming one of the limitations of BH applied to the entire collection of 
hypotheses. 

• Stages 1 and 2 are governed by two separate testing procedures. Generally speaking, this 
could imply that the set of discoveries in the two steps are not in perfect correspondence: 
one could reject the intersection null hypothesis corresponding to a variant, but not reject 
any of the single hypotheses on the association between that variant and the individual 
phenotypes. The set-up of Testing Frocedure 1—^where p-values for the intersection hy¬ 
potheses are obtained with Simes's rule and Stages 1 and 2 use BH—assures that this is 
not the case whenever qi < q 2 ' as long as the global null corresponding to one variant is 
rejected, this variant is declared to be associated with at least one phenotype. 


Results 

Simulations with independent tests 

To illustrate the operating characteristics of the hierarchical procedure, we rely first on simula¬ 
tions with all tests independent. Exploration of typical GWAS dependence will be discussed in 
the next section. Figure [^summarizes the results of two scenarios: M= 3000, F=100 and in (A) 60 
variants are associated with 25 phenotypes (as in Figure j^, while in (B) 1500 variants are associ¬ 
ated with 5 phenotypes. F-values were generated as for Figure [^ Four strategies are compared: 
(a) gFDR control with BH ("pooled BH"); (b) Bonferroni targeting gFWER ("pooled Bonferroni"); 
(c) Testing Frocedure 1 ("hierarchical BH"); (d) hierarchical testing targeting vFWER, via Bon¬ 
ferroni applied on the Simes's p-values, and sFDR ("hierarchical Bonferroni"). The target for all 
error rates is 0.05. 

All procedures control their respective targeted error rates, and the two hierarchical proce¬ 
dures also control gFDR. The power of the hierarchical procedure that controls vFDR is compara¬ 
ble to that of applying BH to the entire collection of hypotheses, and the power of the procedure 
that targets vFWER is comparable to or better than that of Bonferroni on the entire collection. 
The hierarchical procedures show an advantage when the families with non-null hypotheses are 
a small subset of the total families. In such cases, BH applied to the pooled collection of p-values 
fails to control vFDR and sFDR. This is precisely the situation we expect to hold in GWAS: only 
a small proportion of SNFs are associated to any phenotype. The substantial increase in power 
of "hierarchical Bonferroni" over "pooled Bonferroni" in (A) is due to the adaptivity of BH to the 
proportion of false null hypotheses in the families: when a SNF is selected which has effects on 
multiple phenotypes, it becomes easier to detect these associations. 

Given that the relative advantages of the procedures we are considering depend on the 
number of families and the number of true null hypotheses they contain, we run a simulation 
with dimensions that should resemble that of a GWAS involving multiple traits: 100,000 SNFs 
and 100 phenotypes. In Figure [^ most of the families contain only true null hypotheses, except 
for 1000 variants that are associated with 25 phenotypes and 500 variants that are associated with 
one phenotype each. This last type of family is included both to account for phenotype specific 
effects and to evaluate the possible loss of power in detecting these variants for the hierarchical 
strategy: in addition to the global power (gFower), we report power to detect variants (vFower) 
and power to detect variants that affect only one phenotype (SingletonFower). 
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Figure 3: Error rates and power for four multiple-testing strategies. M=3000, P = 100 and test 
statistics are independent. In (A) 60 variants are associated with 25 traits each and in (B) 1500 
variants are associated with 5 phenotypes each. The solid lines show the average, the shaded 
areas represent the standard error over 250 iterations, and the dotted horizontal lines mark the 
level 0.05. 
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Figure 4: Error rates and power for four multiple-testing strategies. M=100,000, P=100, and 1000 
variants are associated with 25 traits and 500 variants are associated with 1 trait each. The lines 
show the average, the shaded areas represent the standard error over 250 iterations, and the 
dotted horizontal lines mark level 0.05. 
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As expected, simply applying BH to the entire collection of hypotheses results in a substan¬ 
tial increase of the vFDR and sFDR, with no substantial power advantage. Indeed, the overall 
power is better for the hierarchical strategy, even if this encounters a loss of power to detect SNPs 
that are associated with only one phenotype. The validity of these results is limited by the fact 
that those simulations were based on independent test statistics. In practice there are multiple 
sources of dependence and we now explore their effects on the hierarchical procedure. 

GWAS dependence structure 

The markers typed in GWAS are typically chosen to span the entire genome at a high density. 
SNPs in the same neighborhood are not independent, but in linkage disequilibrium. This redun¬ 
dancy assures that the typed markers can effectively act as proxies for untyped variants and is 
one of the sources of dependency relevant for our study. 

To understand other departures from independence, it is useful to look at the relationship 
between phenotypes and genotypes and the methods with which these are analyzed. In its 
simplest form, the data-generating model considered by geneticists to link each phenotype t to 
genotypes is yu = + C;, where C/ are uncorrelated and i indicates subjects. The coefficient 

vector )8 is thought to be sparse (that is with a small proportion of non-zero elements) or effec¬ 
tively sparse in the sense that a small portion of the coefficients have appreciable size. When 
considering multiple phenotypes and n subjects, this translates into 

Y = XB -F E, (2) 

where Y„xP/ B^^xP and E„xP are matrices containing phenotypes, genotypes, coefficients, 

and error terms, respectively. While most of the rows of B are full of zeros, some rows are 
expected to contain more than one non-zero element, corresponding to genomic locations that 
influence multiple phenotype (pleiotropy): the resulting phenotypes are not independent, even 
when the elements of the error matrix are iid. 

GWAS data is generally analyzed using a collection of univariate regressions linking each 
phenotype t to one genetic variant v: 

^[,t] = a + + ^it], (3) 

and the hypothesis Hut translates into H : fiut = 0, tested with the standard t-statistics. Glearly, 
the discrepancy between even the theoretical model (|^ and the regression ^ used for analysis 
leads to a number of consequences. For example, as the error terms Ejj cannot be expected 
to be uncorrelated across individuals, linear mix models are often used in single phenotype 
analysis Il28ll . Moreover, the combination of spatial dependence existing across SNPs and the 
univariate testing approach 0 induces spatial structure among both the test statistics and the 
hypotheses. Gonsider the case of a complete null where the phenotypes under study have no 
genetic underpinning. If by random chance one variant appears to have some explanatory power 
for one phenotype, the p-values of neighboring SNPs will also tend to have lower values—this is 
dependence among the test statistics. Gonsider now a data-generating model (|^ where variant 
V has a coefficient different from zero while its neighbors do not. With respect to model (j^ Hut 
is false and the Hit for neighboring SNPs I are true. However, once we decide to look at the data 
through the lenses of 0, the hypotheses Hit are redefined to mean the lack of any association 
between SNP I and phenotype t and—as long as SNP I can act as a reasonable proxy for one of the 
causal variants —Hit is false. We expect clusters of null hypotheses corresponding to neighboring 
SNPs to be false or true together. Indeed, in GWAS studies it is common to find a number 
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of nearby variants significantly associated with the trait: this is interpreted as evidence for the 
presence of one or more causal variants in the specific genomic region. Looking at multiple 
phenotypes that might share genetic determinants adds another layer to this phenomenon. 

On the one hand, dependence between test statistics can be problematic for multiplicity ad¬ 
justment strategies. The Bonferroni approach controls FWER even if tests are dependent; the 
Benjamini-Hochberg procedure, instead, is guaranteed to control FDR under independence or 
positive regression dependence on a subset (PRDS) ||27|, even if it has been empirically observed 
to provide FDR control under broader conditions. When the BH procedure controls FDR under 
the dependence of the p-values within each family and the p-values in each family are inde¬ 
pendent of the p-values in any other family, the Testing Procedure 1 controls vFDR and sFDR. 
Provided that certain overall positive dependence properties hold, these error rates remain con¬ 
trolled when the p-values across the families are not independent. In particular, when the pooled 
set of p-values is PRDS, sFDR is controlled (see Theorem 3 in Il20l : note that this is the same 
condition needed for pooled BH to control gFDR). In addition, it can be concluded from the 
simulation results of ||29] that when {pvt,v = 1,.. .,M} are PRDS for each f G {1,.. .,P}, and 
when {pvt, t = 1,... ,P} are PRDS for each r; G {!,..., M}, vFDR is controlled. 

On the other hand, the fact that tested hypotheses H-ot are defined with respect to Q rather 
than the data generative model (|^ makes it challenging to evaluate the error made by a multiple¬ 
testing procedure: if we use Q as ground truth, we expect many false rejections that really do not 
correspond to a mistake with reference to (|^. In order to avoid this problem, we will consider all 
the hypotheses relative to variants that are sufficiently close to a causal variant in the generative 
model as correctly rejected. 

For the simulations below we use genotype data obtained from 1966 Northern Finland Birth 
Cohort (NFBC) Il30l . We exclude copy number variants and markers with p-values for Hardy- 
Wernberg equilibrium below le-5, with minor allele frequency (MAF) less than 0.01, or with call 
rate less than 95%. This screening results in M = 334,103 SNPs on n = 5,402 subjects. We code 
SNPs by minor allele count and impute missing genotypes by average variant allele count. We 
simulate P = 100 traits. In each iteration, we select 130 SNPs at random and use them to generate 
phenotypes, as follows: the first 10 SNPs affect 50 phenotypes, the next 10 affect 25, the next 10 
affect 10 and the final 100 each affect 5 phenotypes, always chosen at random. In this set up, each 
trait reflects the contribution of 13.5 SNPs on average. The more than 300,000 SNPs remaining 
have no functional role. To generate the simulated traits, we follow the linear model in equation 
(|^ where B^t is 1 in presence of an association between variant v and trait t and 0 otherwise. 

Due to the large number of hypotheses under consideration, we rely on MatrixEQTL to 
allow efficient computation of the p-values of association. This software, originally designed for 
the analysis of eQTL data, utilizes large matrix operations to increase computational speed and 
has the option to reduce the required memory by saving only p-values beneath a given threshold. 
As long as this threshold is above the p-value cut-off for selection under all error control methods, 
this shortcut does not affect the results. In applying MatrixEQTL, we use a threshold of 5e-4 for 
saving output and include the first 5 principal components of the genotype data as covariates to 
adjust for the effects of population structure. 

Under varying levels of noise a, we compare four adjustment strategies studied before. 
When analyzing the results, we consider a discovery a true positive if it lies within 1Mb and 
has correlation at least 0.2 to the truly causal SNP The results, given in Figure show that 
even with this allowance, there are still settings where some of the methods under consideration 
fail to control their target error rates. In particular, pooled Bonferroni fails to control gPWER 
and hierarchical Bonferroni fails to control vFWER for settings with higher levels of power. In 
addition, gFDR is somewhat above 0.05 for pooled BH and vFDR exceeds 0.05 for hierarchical BH 
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Noise level 


Method 

Pooled BH 
Pooled Bonferroni 
Hierarchical BH 
Hierarchical Bonferroni 


Figure 5: Error rates and power for four multiple-testing strategies applied to simulated data 
starting from real genotypes. The lines show the average, the shaded areas report the standard 
error over 100 iterations, and the dotted horizontal lines mark the 0.05 level. 


in the setting with highest power. Rather than a failure of the multiple comparisons procedure, 
this is to be attributed to the confusion induced by the use of model ^ to analyze data generated 
with model Q; when we re-run the analysis using phenotypes adjusted for the effects of variables 
omitted by the univariate model, these errors appear appropriately controlled. EWER is more 
sensitive to these misspecification errors simply because one single mistake is enough to raise the 
realized FWE to 1; in contrast, as long as these mistakes are accompanied by a number of true 
discoveries, the realized false discovery proportion will only be marginally inflated. Focusing 
on the performance of hierarchical methods compared, we again conclude that they appear to 
control their targeted error rates whenever the corresponding pooled approach controls gFDR or 
gFWER. 

Case study: Flowering in Arabidopsis thaliana 

We use Testing Procedure 1 to re-analyze data on the genetic basis of flowering phenotypes in 
Arabidopsis thaliana, Il32l online at Il33ll . While the original study includes 109 different traits, 
we focus on 23 phenotypes related to flowering including days to flowering under different 
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Figure 6: Families with differential results from pooled BH and hierarchical BH for the case 
study. Discoveries under both methods are marked with •, discoveries made only under pooled 
BH are marked with and discoveries made only under hierarchical BH are marked with 


conditions, plant diameter at flowering, and number of leaves at flowering, etc.; the results in 
l32l indicate that a shared genetic basis is likely for at least some of these traits. Genotypes are 
available for 199 inbred lines at 216,130 SNPs. 

To obtain p-values of association, we follow the steps described in |j33: exclude SNPs with a 
MAF < 0.1, transform certain phenotypes to the log scale, and fit the variance components model 
implemented in [341, which allows us to account for population structure. The original analysis 
underscored the difficulties of identifying true positives only on the basis of statistical consider¬ 
ations and did not attempt formal multiplicity adjustment. While these challenges clearly still 
stand, here we compare the results of applying BH across the full set of p-values targeting gFDR 
at level 0.05, with those of Testing Procedure 1 targeting vFDR and sFDR, each at level 0.05. This 
means that for the hierarchical procedure, we have M = 216,130 families corresponding to SNPs, 
each consisting of 23 hypotheses. 

Hierarchical BH identifies 131 variants versus the 139 of pooled BH, reflecting a tighter 
standard for variant discovery. At the same time, hierarchical BH increases global power over 
pooled BH, resulting in a total of 174 discoveries versus 161: an increase of 8%. The variants that 
pooled BH discovers in excess of hierarchical BH are declared associated to one phenotype only. 
There are 7% fewer such SNPs according to the hierarchical procedure. Table [^presents variants 
with different results under the two methods: 8 SNPs discovered by pooled BH as associated with 
only one phenotype are not selected by hierarchical BH, while several SNPs discovered under 
pooled BH are associated to a larger number of phenotypes by hierarchical BH. For example, 
the SNP in column 1 of Figure corresponds to a particular location in the short vegetative 
phase (SVP) gene, that is known to be involved in flowering and associated to two additional 
phenotypes under the hierarchical method. 
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Discussion 


Contemporary genomic investigations result in testing very large number of hypotheses, making 
it vital to adopt appropriate strategies for multiplicity adjustment: the risk of lack of repro¬ 
ducibility of results is too high to be overlooked. When the collection of tested hypotheses 
has some structure, discoveries often occur at multiple levels and reports typically do not fo¬ 
cus on the rejection of hypotheses at the finest scales. In the hope of increasing both power and 
interpretability, scientists often attempt to outline an overall picture with statements that are sup¬ 
ported by groups of hypotheses. We considered one example of such situations: in genome-wide 
association studies concerning a large number of phenotypes the primary object of inference is 
often the identification of variants that are associated to any trait. 

The simulations presented make clear that in these settings it is necessary to identify what is 
to be considered a discovery and to perform a multiplicity adjustment that allows one to control 
measures of global error defined on the discoveries of interest. By adapting the work in ||20| , we 
outline one such strategy and explore its performance and relative advantages in the context of 
GWAS studies involving multiple phenotypes. 

Our hierarchical strategy aims at (a) identifying SNPs which affect some phenotypes (while 
controlling errors at this level) and (b) detecting which phenotypes are influenced by such SNPs 
(controlling the average error measure across selected SNPs). We consider two error measures: 
FDR and FWER. We show that while our strategy achieves these goals, applying FDR controlling 
rules (as BH) on the entire collection of hypotheses ("pooled BH") does not control the FDR of the 
discoveries in (a) and (b): whenever the reporting of results emphasizes these, other multiplicity 
adjustments need to be in place. On the other hand, the "hierarchical BH" procedure is not 
guaranteed to control the global FDR (gFDR) in general, but it effectively appears to do so in 
the situations we simulated. Applying Bonferroni to the pooled collection of hypotheses does 
control FWFR for the discoveries in (a) and sFDR for the discoveries in (b), but it is excessively 
conservative if these are the target error rates. Conversely, the "hierarchical Bonferroni" strategy 
does not control global FWFR. 

To complete this summary of results, we shall make a few remarks. First, while the appli¬ 
cation to GWAS studies has motivated us and guided the exposition of material as well as some 
specific implementation choices, it is important to note that Testing Procedure 1 is applicable to 
much broader settings. It simply rests on the possibility of organizing the entire collection of 
tested hypotheses in groups of separate families, each probing a meaningful scientific question. 

Secondly, it is worth noting that the hierarchical strategy represents one example of valid 
selective inference. More and more, as the modalities of data collection become increasingly com¬ 
prehensive rather than targeted, scientists tend to "look at the data first and ask questions later." 
In other words, initial exploratory analyses are used to identify possible meaningful patterns and 
formulate precise hypotheses for formal statistical testing. When this is the case, however, the 
traditional rules for determining significance are inappropriate and procedures that account for 
the selection effects are called for. The work of Benjamin! and Bogomolov Il20l that we adapt here 
is an important step in this direction. 

Moving on to the specific implications for multi-phenotype GWAS, the results of our sim¬ 
ulations using actual genotypes contribute to the debate on whether to choose FDR or FWFR 
as targeted error rate. The combination of correlation between SNPs and misspecification of the 
linear model that is routinely used in GWAS applications can result in the rejection of hypotheses 
of no association between a SNP and a phenotype even when the SNP has no causal effect and 
is reasonably far from any causal variants. In procedures that target FDR control, these "false" 
rejections are accompanied by a number of correct ones and their effect on the error rate is mod- 
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est. Conversely, the presence of even one such wrong rejection equates the realized FWE to one: 
this makes it very hard to really control FWER in situations other than global null. 

Because of the disparities in targeted error rates, it is difficult to contrast the power of the 
hierarchical and pooled strategies as this comparison is most meaningful across procedures that 
guarantee the same error level. However, it is of practical relevance to contrast the number and 
characteristics of true findings that a researcher can expect when adopting the pooled and the 
hierarchical procedure targeting the respective error rates at the 0.05 level. Both the BH strategies 
appear to control global FDR and our simulations indicate that overall power is quite similar: the 
pooled approach discovers more SNPs that truly affect a single phenotype and the hierarchical 
approach discovers more SNPs that affect multiple phenotypes. The same trend is evident in the 
real-data analysis. Note that the false discovery rate among SNPs that are declared associated 
with one phenotype by the pooled BH strategy can be very high. Both Bonferroni strategies 
control the FWER of SNP discoveries and the average FDR for SNP-phenotype associations across 
selected SNP: the hierarchical approach (which does not control global FWER) has greater power, 
once again thanks to the increased discovery of SNPs associated to multiple phenotypes. 

While we have not discussed this so far, it will not have escaped the attentive reader that 
the hierarchical procedure we propose can be applied in meta-analyses of GWAS studies of the 
same trait. In this setting, one typically has independence across studies and multiple powerful 
choices of p-value for the global nulls are available in Stage 0. The contribution of the hierarchical 
procedure in this context is in Stage2, where studies with significant association are identified. 

A final remark is in order with reference to the application of the proposed approach to 
multi-phenotype GWAS studies. In our simulations we have accounted for inter-marker depen¬ 
dence and dependence across phenotypes due to shared genetic causes. We have not explored 
the results of dependence across phenotypes due to environmental components. Gonsider eQTL 
studies where the traits are measurements of expression levels of multiple genes: it has been 
repeatedly observed that experimental batch effects can result in strong dependence between 
traits. If such correlation between phenotypes is present, it would be crucial to account for it in 
the method of analysis used to define p-values. In absence of this, it is quite possible that some 
of the environmental effects might be accidentally correlated with the genotype value of some 
of the SNPs in the study resulting in a number of false positives which would be exacerbated 
by the hierarchical approaches. Indeed, the procedures we outlined here are valid as long as the 
p-values used as input are accurate; obtaining such p-values is clearly of paramount importance, 
but the topic of another report. 
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